Flink: Project the RowData to remove meta-columns #3240

Reo-LEI · 2021-10-07T12:18:53Z

This PR is completed on the basis of #2731 and trying to fixes #2730. Thanks for the contribution of @openinx.

In this PR, I make RowDataProjection as row data wrapper as this comment #2731 (comment) mentioned and supprot the Map and List type projection.

# Conflicts: # flink/src/main/java/org/apache/iceberg/flink/source/RowDataFileScanTaskReader.java # flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java

…w-data

… field type.

api/src/main/java/org/apache/iceberg/util/StructProjection.java

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

…ceberg into flink-project-row-data

Reo-LEI · 2021-10-21T03:59:52Z

@rdblue @kbendick @openinx @stevenzwu Could you take a look of this again? 😄

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

kbendick · 2021-10-21T19:02:51Z

flink/src/test/java/org/apache/iceberg/flink/data/TestRowDataProjection.java

+            )
+        ))
+    );
+    AssertHelpers.assertThrows("Should be error because cannot project a partial nested list element.",


Nit: This is a little confusing for me at first.

Can we possibly rephrase this as Should not all users to project onto a subset of fields of a struct used in a list type? That would make what's being tested a bit more clear (at least for me) from the get go.

kbendick · 2021-10-21T21:40:40Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+            boolean elementProjectable = !projectedList.elementType().isNestedType() ||
+                projectedList.elementType().equals(originalList.elementType());
+            Preconditions.checkArgument(elementProjectable,
+                "Cannot project a partial list element RowData. Trying to project %s out of %s",


See note below about this exception message.

Here I trying to keep this message same as StructLikeProjection. I feel this msg is ok, What do you think?

Ditto : https://github.com/apache/iceberg/pull/3240/files#r736338047

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

flink/src/test/java/org/apache/iceberg/flink/TestChangeLogTable.java

flink/src/test/java/org/apache/iceberg/flink/data/TestRowDataProjection.java

stevenzwu · 2021-10-24T05:11:40Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+  private static RowData.FieldGetter createFieldGetter(RowType rowType,
+                                                       Types.StructType rowStruct,
+                                                       Types.NestedField projectField) {
+    for (int i = 0; i < rowStruct.fields().size(); i++) {


this loop find essentially results in n^2 complexity. We can use this API from StructType.

public NestedField field(int id)

Here we not only need to found the row field which field id equal to project field id, but also need to know the position of the match field. Even if we can get the match row field by StructType.field(int id), we also need to traverse the rowStruct to found out the field position again.

Got it. Can we iterate through the schema once and set up the mapping btw field id and position id? I have a little performance concern of n^2 complexity for table with a lot of columns (like thousands or more).

+1. There are tables with very high cardinality where this will potentially have a real performance impact. This tends to be especially true for base tables (raw ingested data events from clients etc), which often have very wide schemas and is also an area where Flink is pretty commonly used.

Anything that can be done to reduce this overhead would be great.

I think that is a great idea, let's do this~

I have a little performance concern of n^2 complexity for table with a lot of columns (like thousands or more)

I'm fine with either. Because the complexity is actually n*m, let's say the n is the table's field number and m is the projection fields number. If both @stevenzwu and @kbendick think it's necessary to do, I'm okay with it.

I construct a fieldIdToPosition map and use StructType.field(int id) to find the row field. Now the complexity reduce to n, I think the performance will not be a problem.

openinx · 2021-10-26T07:17:18Z

@Reo-LEI Could you take a look for the checkstyle issue ?

Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java:32:8: Unused import - org.apache.iceberg.StructLike. [UnusedImports]

openinx · 2021-10-26T07:21:10Z

I'd like to take a look for this PR today, I think it's critical important feature for our flink users to read the v2 table. Thanks @Reo-LEI for picking up this PR !

Reo-LEI · 2021-10-26T07:27:34Z

@Reo-LEI Could you take a look for the checkstyle issue ?

Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java:32:8: Unused import - org.apache.iceberg.StructLike. [UnusedImports]

Sure, I will fix this latter.

openinx · 2021-10-26T08:32:39Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+            RowType nestedRowType = (RowType) rowType.getTypeAt(i);
+            int rowPos = i;
+            return row -> {
+              RowData nestedRow = row.isNullAt(rowPos) ? null : row.getRow(rowPos, nestedRowType.getFieldCount());


Q: If the nestedRow is null, do we still need to traverse the nested fields by using the RowDataProjection#project ? I think we can just return the null for the projection value ?

I had a small patch for this:

diff --git a/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java b/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java index 9d1e8ea67..25a5b3ab3 100644 --- a/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java +++ b/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java @@ -45,7 +45,11 @@ public class RowDataProjection implements RowData { * @return a wrapper to project rows */ public static RowDataProjection create(Schema schema, Schema projectedSchema) { - return new RowDataProjection(FlinkSchemaUtil.convert(schema), schema.asStruct(), projectedSchema.asStruct()); + return RowDataProjection.create(FlinkSchemaUtil.convert(schema), schema.asStruct(), projectedSchema.asStruct()); + } + + public static RowDataProjection create(RowType rowType, Types.StructType schema, Types.StructType projectedSchema) { + return new RowDataProjection(rowType, schema, projectedSchema); } private final RowData.FieldGetter[] getters; @@ -73,9 +77,14 @@ public class RowDataProjection implements RowData { RowType nestedRowType = (RowType) rowType.getTypeAt(i); int rowPos = i; return row -> { - RowData nestedRow = row.isNullAt(rowPos) ? null : row.getRow(rowPos, nestedRowType.getFieldCount()); - return new RowDataProjection(nestedRowType, rowField.type().asStructType(), - projectField.type().asStructType()).wrap(nestedRow); + if (row.isNullAt(rowPos)) { + return null; + } else { + RowData nestedRow = row.getRow(rowPos, nestedRowType.getFieldCount()); + return RowDataProjection + .create(nestedRowType, rowField.type().asStructType(), projectField.type().asStructType()) + .wrap(nestedRow); + } }; case MAP:

I think we could not return null when the nestedRow is null. Because StructProjection will still project the nested struct even if the nested struct is null. If we return null here, the unittest will fail, because the expected record is not null but actual row data is null.

iceberg/flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java

Line 125 in e4d841b

Assert.assertTrue("expected Record and actual RowData should be both null or not null",

openinx · 2021-10-26T08:34:59Z

flink/src/main/java/org/apache/iceberg/flink/source/RowDataFileScanTaskReader.java

+
+    // Project the RowData to remove the extra meta columns.
+    if (!projectedSchema.sameSchema(deletes.requiredSchema())) {
+      RowDataProjection rowDataProjection = RowDataProjection.create(deletes.requiredSchema(), projectedSchema);


I see the RowDataProjection#create does a FlinkSchemaUtil.convert(schema) for the required schema to project, and I believe the FlinkDeleteFilter also did the same thing inside. I think we can reuse the converted flink row type between them.

Good point! Now I get the row type from FlinkDeleteFilter and pass it to RowDataProjection.

flink/src/main/java/org/apache/iceberg/flink/source/RowDataFileScanTaskReader.java

openinx · 2021-10-26T08:55:25Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+    return this;
+  }
+
+  public Object getValue(int pos) {


Nit: this can be a private method, right ?

openinx · 2021-10-26T09:03:06Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+        }
+      }
+    }
+    throw new IllegalArgumentException(String.format("Cannot find field %s in %s", projectField, rowStruct));


Nit: I think we need a more clear message for this exception: Cannot locate the project field <%s> in the iceberg struct <%s>

openinx · 2021-10-26T09:06:47Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+    for (int i = 0; i < rowStruct.fields().size(); i++) {
+      Types.NestedField rowField = rowStruct.fields().get(i);
+      if (rowField.fieldId() == projectField.fieldId()) {
+        Preconditions.checkArgument(rowField.type().typeId() == projectField.type().typeId(),


Nit: this can be simplified by the following lines as the Preconditions.checkArgument can format the error message directly.

Preconditions.checkArgument(rowField.type().typeId() == projectField.type().typeId(), "Different iceberg type between row field <%s> and project field <%s>", rowField, projectField);

openinx · 2021-10-26T09:16:07Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+            RowType nestedRowType = (RowType) rowType.getTypeAt(i);
+            int rowPos = i;
+            return row -> {
+              RowData nestedRow = row.isNullAt(rowPos) ? null : row.getRow(rowPos, nestedRowType.getFieldCount());


I had a small patch for this:

diff --git a/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java b/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java index 9d1e8ea67..25a5b3ab3 100644 --- a/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java +++ b/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java @@ -45,7 +45,11 @@ public class RowDataProjection implements RowData { * @return a wrapper to project rows */ public static RowDataProjection create(Schema schema, Schema projectedSchema) { - return new RowDataProjection(FlinkSchemaUtil.convert(schema), schema.asStruct(), projectedSchema.asStruct()); + return RowDataProjection.create(FlinkSchemaUtil.convert(schema), schema.asStruct(), projectedSchema.asStruct()); + } + + public static RowDataProjection create(RowType rowType, Types.StructType schema, Types.StructType projectedSchema) { + return new RowDataProjection(rowType, schema, projectedSchema); } private final RowData.FieldGetter[] getters; @@ -73,9 +77,14 @@ public class RowDataProjection implements RowData { RowType nestedRowType = (RowType) rowType.getTypeAt(i); int rowPos = i; return row -> { - RowData nestedRow = row.isNullAt(rowPos) ? null : row.getRow(rowPos, nestedRowType.getFieldCount()); - return new RowDataProjection(nestedRowType, rowField.type().asStructType(), - projectField.type().asStructType()).wrap(nestedRow); + if (row.isNullAt(rowPos)) { + return null; + } else { + RowData nestedRow = row.getRow(rowPos, nestedRowType.getFieldCount()); + return RowDataProjection + .create(nestedRowType, rowField.type().asStructType(), projectField.type().asStructType()) + .wrap(nestedRow); + } }; case MAP:

openinx · 2021-10-26T09:20:07Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+  private static RowData.FieldGetter createFieldGetter(RowType rowType,
+                                                       Types.StructType rowStruct,
+                                                       Types.NestedField projectField) {
+    for (int i = 0; i < rowStruct.fields().size(); i++) {


I have a little performance concern of n^2 complexity for table with a lot of columns (like thousands or more)

I'm fine with either. Because the complexity is actually n*m, let's say the n is the table's field number and m is the projection fields number. If both @stevenzwu and @kbendick think it's necessary to do, I'm okay with it.

openinx · 2021-10-26T09:20:45Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+  private static RowData.FieldGetter createFieldGetter(RowType rowType,
+                                                       Types.StructType rowStruct,
+                                                       Types.NestedField projectField) {
+    for (int i = 0; i < rowStruct.fields().size(); i++) {


I have a little performance concern of n^2 complexity for table with a lot of columns (like thousands or more)

I'm fine with either. Because the complexity is actually n*m, let's say the n is the table's field number and m is the projection fields number. If both @stevenzwu and @kbendick think it's necessary to do, I'm okay with it.

openinx · 2021-10-26T09:30:57Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+            boolean valueProjectable = !projectedMap.valueType().isNestedType() ||
+                projectedMap.valueType().equals(originalMap.valueType());
+            Preconditions.checkArgument(keyProjectable && valueProjectable,
+                "Cannot project a partial map key or value RowData. Trying to project %s out of %s",


We should say Cannot project a partial map key or value with non-primitive type, Trying .., the assert failure does not mean it's necessary to be a RowData, it can be other data types such as list or map etc.

Reo-LEI · 2021-10-27T12:47:09Z

I adressed some comment and leave some comment to discuss. I think you can take another looks of this PR. @rdblue @openinx @stevenzwu @kbendick 😄

openinx · 2021-11-01T05:12:39Z

Let me take another look today ! Thanks @Reo-LEI for the updating.

openinx · 2021-11-01T05:30:03Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+  /**
+   * Creates a projecting wrapper for {@link RowData} rows.
+   * <p>
+   * This projection does not work with repeated types like lists and maps.


This projection does not work with repeated types like lists and maps with nested children types ? I think it works fine for lists/maps with primitive children types.

It's better to say: Projecting a partial map key or value with non-primitive type does not work in this projection wrapper

The meaning of this comment is exactly what you said that the projection will not project the nested children types of repeated types. I will rephrase it.

openinx · 2021-11-01T05:31:40Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+  /**
+   * Creates a projecting wrapper for {@link RowData} rows.
+   * <p>
+   * This projection does not work with repeated types like lists and maps.


openinx · 2021-11-01T06:36:58Z

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java

+      if (rowField == null) {
+        throw new IllegalArgumentException(String.format(
+            "Cannot locate the project field <%s> in the iceberg struct <%s>", projectField, rowStruct));
+      }


Nit:

Preconditions.checkNotNull(rowField, "Cannot locate the project field <%s> in the iceberg struct <%s>", projectField, rowStruct);

openinx

Looks good to me overall, thanks @Reo-LEI for the great contribution and thanks @stevenzwu for the double check. I left several minor comments.

Reo-LEI · 2021-11-01T07:27:45Z

I addressed the rest of comments just now, and you can check this again @openinx. And thanks @openinx @stevenzwu @kbendick @rdblue for review,

kbendick · 2021-11-01T23:41:59Z

Sorry for missing some pings. Was out of office for a few weeks a while back and have still been playing a bit of catch up.

Please feel free to message me on slack if it's urgent btw. But retroactive +1. 🙂

openinx and others added 7 commits June 24, 2021 20:16

Flink: Project the RowData from DeleteFilter to remove meta-columns.

37db27a

Address the nested projection issues

071079b

Merge remote-tracking branch 'community/master' into project-row-data

8a6a6f0

# Conflicts: # flink/src/main/java/org/apache/iceberg/flink/source/RowDataFileScanTaskReader.java # flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java

Merge branch 'apache:master' into flink-project-row-data

b9e95c6

Merge branch 'apache:master' into flink-project-row-data

a3b3741

Merge remote-tracking branch 'community/master' into flink-project-ro…

71e1938

…w-data

Make RowDataProjection as row data wrapper and support project nested…

b29c56f

… field type.

github-actions bot added API flink labels Oct 7, 2021

Reo-LEI mentioned this pull request Oct 7, 2021

Flink: Project the RowData from DeleteFilter to remove meta-columns. #2731

Closed

Fix checkstyle.

82ea9e5

rdblue reviewed Oct 8, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/util/StructProjection.java Show resolved Hide resolved

kbendick reviewed Oct 18, 2021

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java Outdated Show resolved Hide resolved

Reo-LEI and others added 3 commits October 21, 2021 11:46

Merge branch 'apache:master' into flink-project-row-data

c69608c

Remove the commented out code.

34670be

Merge branch 'flink-project-row-data' of https://github.com/Reo-LEI/i…

e3a73da

…ceberg into flink-project-row-data

kbendick reviewed Oct 21, 2021

View reviewed changes

stevenzwu reviewed Oct 24, 2021

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java Show resolved Hide resolved

stevenzwu reviewed Oct 24, 2021

View reviewed changes

flink/src/test/java/org/apache/iceberg/flink/TestChangeLogTable.java Outdated Show resolved Hide resolved

stevenzwu reviewed Oct 24, 2021

View reviewed changes

flink/src/test/java/org/apache/iceberg/flink/TestChangeLogTable.java Outdated Show resolved Hide resolved

stevenzwu reviewed Oct 24, 2021

View reviewed changes

flink/src/test/java/org/apache/iceberg/flink/TestChangeLogTable.java Show resolved Hide resolved

stevenzwu reviewed Oct 24, 2021

View reviewed changes

flink/src/test/java/org/apache/iceberg/flink/data/TestRowDataProjection.java Show resolved Hide resolved

stevenzwu reviewed Oct 24, 2021

View reviewed changes

Reo-LEI and others added 2 commits October 25, 2021 17:39

Merge branch 'apache:master' into flink-project-row-data

35eda4c

Adressing some comments.

b14ace1

openinx reviewed Oct 26, 2021

View reviewed changes

Adressing some comments.

4e7b786

openinx reviewed Nov 1, 2021

View reviewed changes

openinx approved these changes Nov 1, 2021

View reviewed changes

Adressing some comments.

debc299

openinx merged commit b4ac277 into apache:master Nov 1, 2021

openinx mentioned this pull request Nov 1, 2021

Using Kafka to insert multiple pieces of data with the same primary key value in Iceberg at one time, the data cannot be queried #2627

Closed

Reo-LEI deleted the flink-project-row-data branch November 1, 2021 09:04

kbendick pushed a commit to kbendick/iceberg that referenced this pull request Nov 2, 2021

Flink: Project the RowData to remove meta-columns (apache#3240)

3a18237

kbendick mentioned this pull request Nov 2, 2021

Investigate amount of work needed to backport #3240 to 0.12.1 #3443

Closed

rdblue mentioned this pull request Nov 3, 2021

What's the correct semantic when projecting a required nested field from an optional struct ? #2738

Closed

openinx mentioned this pull request Mar 29, 2022

Flink: Add unit test to guarantee v2/v1 table without any deletes won't project to exclude the meta-columns. #3431

Closed

stevenzwu mentioned this pull request May 6, 2023

API, Flink: StructProjection returns null projection object for null nested struct value #7517

Merged

Flink: Project the RowData to remove meta-columns #3240

Flink: Project the RowData to remove meta-columns #3240

Uh oh!

Conversation

Reo-LEI commented Oct 7, 2021

Uh oh!

Uh oh!

Uh oh!

Reo-LEI commented Oct 21, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openinx commented Oct 26, 2021

Uh oh!

openinx commented Oct 26, 2021

Uh oh!

Reo-LEI commented Oct 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reo-LEI commented Oct 27, 2021

Uh oh!

openinx commented Nov 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Reo-LEI commented Nov 1, 2021 •

edited

Loading